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METHODS, SOFTWARE ARRANGEMENTS, STORAGE MEDIA, 

AND SYSTEMS FOR GENOTYPING OR HAPLOTYPING 
POLYMORPHIC GENETIC LOCI OR STRAIN IDENTIFICATION 

5 

CROSS REFERENCE TO RELATED APPLICATION 

This application claims priority from U.S. Patent Application Serial 
No. 60/492,210 filed on August 1, 2003, the entire disclosure of which is incorporated 
herein by reference. 

•V ■ 

10 FIELD OF THE INVENTION 

. fThe present invention relates generally to systems, methods, and 
software arrangements for genotyping or haplotyping polymorphic genetic loci or 
strain identification. 

BACKGROUND OF THE INVENTION 

i5 The human leukocyte antigen ("HLA") region on chromosome 6 is 

V 

highly polymorphic. In particular, the sequence of this region varies from person to 
person. See, e.g., Consolandi et al 9 Human Immunology 2003,64:168. 
Approximately 1,750 different sequence variants or "alleles" have been identified to 
date at the HLA locus. There are many biological implications of the high degree of 
20 heterogeneity of this region. For example, the presence or absence of a given HLA 
allele or "HLA type" may predict the presence or absence of diseases, dictate the 
course of treatment for a patient, or, most notably, determine the compatibility of a 
potential transplant recipient with the donor organ or bone marrow. 

One of the approaches to finding the right allele is to design a 
25 microarray experiment that provides the allele as an answer. In fact, HLA typing by 
sequence hybridization with sequence-specific oligonucleotide probes ("SSOP") is 
currently practiced by the National Marrow Donor Program ("NMDP") for donor- 
recipient matching, along with more traditional serological-based methods. See, e.g., 
Cao et ai 9 Reviews in Immunogenetics 1999, 1:177; and Noreen et ah, Tissue 
30 Antigens 2001,57:221. In a format that is popular in many current test methods, the 
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DNA samples to be classified are amplified with locus-specific primers, and spotted 
onto the microarray chips, thus resulting in multiple copies of identical chips; each 
chip is then hybridized to a different probe. See Balazs et aL, Human Immunology 
2001,62:850; Consolandi et al 9 Human Immunology 2003, 64:168. This 
5 methodology necessitates a new design process every time a new set of patient 
samples must be classified. Moreover, most of the currently used techniques are both 
time-consuming and lack optimality. 

The system, process, storage medium and software arrangement 
according to one exemplary embodiment of present application provides a graph 

10 model on the set of potential probes in which the HLA typing problem is formulated 
mathematically as an optimization problem. According to the present application, it is 
also possible to utilize an algorithm for solving the optimization problem. The 
processes of translating the typing problem to the graph model and translating the 
optimizing probe set back to an experimental design for HLA typing are also 

15 described. Extensions of the graph model to more detailed physical models are 
discussed as well. 

SUMMARY OF THE INVENTION 

Embodiments of systems, processes, storage media and software 
arrangements for genotyping or haplotyping the polymorphic genetic loci or strain 

20 identification according to the present invention may optimize the design of one or 
more microarrays, each containing a set of oligonucleotides that are capable of 
detecting known genotypes or haplotypes at given polymorphic genetic loci. This can 
be done by optimizing the set of oligonucleotides to be incorporated into the 
microarrays and/or by optimizing the arrangement of a set of oligonucleotides on the 

25 microarrays. Such optimization may also be achieved through the application of one 
or more optimization algorithms. The present invention may be useful in typing 
individuals at the HLA loci or other polymorphic genetic loci, and/or may be 
employed to quickly identify viral or bacterial pathogens from which genome 
sequence information is available. 
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In contrast to the conventional systems and methods according to the 
present invention, the sequence-specific probes may be placed on a microarray chip, 
and each patient sample can be applied to the chip to allow the hybridization with one 
or more of the chip-bound probes to occur. With an appropriate selection of probes, 
5 the same chip can be used for all classifications. However, the number of probes to 
be used, their sequence compositions, and their arrangement on the chip are some of 
the design variables that should preferably be considered in preparing the microarray. 
A general solution to solving these design problems is preferably one that allows the 
"recognition" of all existing alleles at a target locus, and/or that can decide that the 

10 given DNA sequence contains an allele that is not in the "known" list. Such an allele 
may be a new, previously unknown allele, or one of the very rare alleles that occur so 
infrequently that they are not considered HLA types.For example, an exemplary 
embodiment of the present invention is directed toward systems, processes, storage 
media and software arrangements for genotyping or haplotyping a DNA sample at one 

15 or more polymorphic loci contained in the sample through the use of microarray pA, 
which can be defined by a set of oligonucleotide hybridization probes configured on 
the surface of the microarray in a two-dimensional arrangement. The process of 
querying a given polymorphic locus in a DNA sample (hereafter referred to as a 
"target" sequence) by a hybridization experiment can be denoted by the expression 

20 (T),yA k )-*D->fj (1) 

where 7} is the true allele contained in the target sequence, juAk is the microarray used 
in the query, D is the data output of the hybridization experiment, and fj is the allele 
inferred from the data. Both processes in equation (1) are described below. 

The problem of genotyping or haplotyping then can be formulated as that of 
25 designing, e.g., the "best" microarray, namely, the set and arrangement of 
oligonucleotide probes that "works" for all known alleles (i.e.,Vj). In the notation 
employed herein, this means finding fiAk which solves the optimization 
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min 2>,4V, 1 (2) 

typej 

min J>, Pr(r y ^f.), 



Here, n x is the indicator function 



|\ if Xistrue\ 
|0, otherwise J 



and Wy is the weight assigned to type j. Initially, wj can be set such that w. =1 V/ . At 

5 a later point it may be desirable to weigh different HLA types differently, based on 
the frequency of their occurrence in human population or some other criteria. 

According to another exemplary embodiment of the present invention, 
a pseudocode can be provided to select an optimal set of oligonucleotide probes to be 
incorporated into the one or more microarray devices to be used to genotype or 

10 haplotype polymorphic loci or identify the strain present in a DNA sample. 
According to this exemplary embodiment, vertex boosting weights (initially set to 
probe information weights) can be used to define a probability distribution on a vertex 
set present in a graphical representation of "response vectors" derived from each 
potential probe sequence. On each iteration of a boosting loop, a random subset of a 

15 specified size can be selected according to the current probability distribution. All 
edges in the induced subgraph on this random subset are broken, with one of the 
terminal vertices removed. The boosting weights of the elements of the subset can 
then be modified so that the vertices that stayed iii the subset are more likely, and the 
vertices that were thrown out are less likely to be chosen on the next iteration. The 

20 boosting loop may be terminated after a predetermined number of iterations have been 
performed without further improvement to the list of top independent sets. In 
addition, the boosting loop can be restarted several times with original probe 
information weights, which prevents convergence to a local optimum. 

The exemplary embodiments of the systems, processes, storage media 
25 and software arrangements of the present invention are generally more beneficial in 
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comparison to conventional methods in that they require fewer probes, thereby 
minimizing the cost associated with the use of such probes. The exemplary 
embodiments of the systems, processes, storage media and software arrangements of 
the present invention are also preferable to conventional methods in that they can 
5 minimize competition among neighboring probes, thereby reducing or eliminating the 
occurrence of systematic biases in the error process. 

For a better understanding of the present invention, together with other 
and further objects, reference is made to the following description, taken in 
conjunction with the accompanying drawings, and its scope will be pointed out in the 
1 0 appended claims. 

DETAILED DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present invention and its 
advantages, reference is now made to the following description, taken in conjunction 
with the accompanying drawings, in which: 

15 Figure 1 A is an exemplary embodiment of a system according to the 

present invention for determining an optimal set of oligonucleotide probes for use in a 
microarray designed to perform genotyping or haplotyping of polymorphic genetic 
loci; 

Figure IB is an exemplary embodiment of a procedure according to the 
20 present invention for determining an optimal set of oligonucleotide probes for use in a 
microarray designed to perform genotyping or haplotyping of polymorphic genetic 
loci; 

Figure 2 is a first exemplary embodiment of a probe selection 
procedure (Shown in Figure IB) of the present invention for determining an optimal 
25 set of oligonucleotide probes for use in a microarray designed to perform genotyping 
or haplotyping of polymorphic genetic loci; 

Figure 3 is a second exemplary embodiment of the probe selection 
procedure shown in Figure IB of the present invention for determining an optimal set 
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of oligonucleotide probes for use in a microarray designed to perform genotyping or 
haplotyping of polymorphic genetic loci, in which exemplary embodiments of certain 
steps of Figure 2 are further illustrated. 

DETAILED DESCRIPTION OF THE INVENTION 

5 An exemplary embodiment of the present invention is directed to 

systems, processes, storage media and software arrangements for genotyping or 
haplotyping polymorphic genetic loci (e.g., an HLA locus, in a subject) or strain 
identification. HLA haplotyping or genotyping may assist in, e.g., a prediction of a 
presence or absence of diseases, selecting a course of treatment for a patient, and/or 
10 determining the compatibility of a potential transplant recipient with the donor organ 
or bone marrow. The systems, processes, storage media and software arrangements 
of the present invention also may be useful for e.g., identifying the presence of an 
unknown pathogen, including but not limited to a virus or a bacterium, in a sample. 

By providing a relatively rapid, inexpensive, and potentially highly 
15 accurate way of genotyping or haplotyping polymorphic genetic loci or strain 
identification, the exemplary systems, processes, storage media and software 
arrangements of the present invention also can be useful in elucidating 
genotype/phenotype correlations in complex genetic disorders, i.e., those in which 
more than one gene may play a significant role, or in genetic diseases characterized 
20 by the presence of a relatively large number of genotypes or haplotypes at 
polymorphic loci. The knowledge obtained from the exemplary systems, processes, 
storage media, and software arrangements of the present invention therefore may 
assist in facilitating the diagnosis, treatment, and prognosis of individuals bearing a 
given phenotype. 

25 Figure 1A shows an exemplary embodiment of a system according to 

the present invention for determining a particular set of oligonucleotide probes for a 
use in a microarray designed to perform genotyping and/or haplotyping of at least one 
polymorphic genetic locus. For example, the system includes a processing 
arrangement 10 (e.g., a computer), which stores a computer program on a storage 

30 arrangement 15 (e.g., memory, hard drive, etc.) to execute the exemplary techniques 



6 



WO 2005/013091 



PCT/US2004/024766 



35950-PCT- 067691.0224 

described herein below. In particular, the computer program, when executed by the 
processing arrangement 10, causes the processing arrangement to obtain known 
sequences representing classes, e.g., HLA allele sequences, from an external source 
20 or any source (as described below). Then the processing arrangement 10 applies a 
5 probe selection technique according to the present invention. Thereafter, the 
processing arrangement 10 outputs the results of the execution of this technique. 

Figure IB illustrates a first exemplary embodiment of a process for 
determining an optimal set of oligonucleotide probes for use in a microarray designed 
to perform genotyping or haplotyping of polymorphic genetic loci or strain 

10 identification which can use the exemplary system shown in Figure 1A. In this 
exemplary embodiment, the process executes a first step 100 in which nucleotide 
sequences of known genotypes or haplotypes at predetermined polymorphic loci, or 
strain-specific sequences, may be obtained. Such information can be obtained from a 
number of sources. For example, this data can be obtained from genetic databases 

15 including, but not limited to, GenBank and/or other databases maintained by public or 
private (commercial and non-commercial) institutions. The databases suitable for use 
with the systems, processes, storage media and software arrangements of the present 
invention will be readily apparent to those of ordinary skill in the art of DNA 
sequencing. Alternatively, such information may be obtained from the prior direct 

20 sequencing of one or more test samples. 

Once a set of known sequences corresponding to genotypes or 
haplotypes at polymorphic genetic loci or strains is assembled, the probe selection 
procedure 200 of the present invention may be applied to this set to identify an 
optimal subset of oligonucleotide probes required to genotype or haplotype the 

25 polymorphic genetic loci or identify the strain. This can be executed by the 
processing arrangement of Figure 1A. The optimization of the set of oligonucleotide 
probes may include, but is not limited to, the determination of the overall lengths of 
the various probes, their sequence compositions, and the minimum number of probes 
required to identify all selected target genotypes or haplotypes or strains. Many if not 

30 all selected target genotypes or haplotypes or strains may include all known genotypes 
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or haplotypes at predetermined polymorphic genetic loci or strains, or a subset 
thereof. For example, the identification may be limited only to those alleles that are 
most prevalent in the population from which the DNA sample to be genotyped and/or 
haplotyped has been obtained. 

5 In another step of the process for determining the optimal set of 

oligonucleotide probes of Figure IB, an optimal subset of oligonucleotide probes 
preferred for genotyping and/or haplotyping polymorphic genetic loci or strain 
identification may be generated by the probe selection technique 200 and processing 
arrangement 10. The optimal probe set 300 may then be arranged on one or more 
10 microarray devices to determine the genotype or haplotype of polymorphic genetic 
loci or identify the strain present in a DNA sample. 

Another exemplary embodiment of the present invention further relates 
to the optimal arrangement of the optimal set of oligonucleotide probes on the one or 
more microarray devices. The exemplary arrangement of the probes may be based on 
one or more "conceptual" measures pf probe distance. The "conceptual" probe 
distance can use available biological knowledge about the portions of the genome 
containing the probes in question, as well as a measure of competition between these 
probes, as described in Chapter 3 of Cherepinsky, Ph.D. Tliesis, New York University 
2003. An exemplary recursive technique according to the present invention can 
facilitate a separation between the "nearest" probe pairs by at least a specified 
minimum physical distance on the chip. This optimized arrangement can be designed 
to be disruptive to neighborhoods, defined with respect to the "conceptual" probe 
distance. Thus, the exemplary arrangement can place conceptually nearby probes far 
apart on the surface, thereby minimizing the likelihood for competitive interactions 
between neighboring probes. 

Figure 2 illustrates a first exemplary embodiment of the probe 
selection algorithm 200 of the present invention for determining the optimal set of 
oligonucleotide probes. In this exemplary embodiment, the biological problem to be 
solved, e.g., designing an optimal set of probes to be used on a microarray for 
30 determining the genotype or haplotype of polymorphic genetic loci, such as the 
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human HLA region, or strain, variation may first be translated into a graph 
optimization problem 210. The graph optimization problem can then be solved (item 
220) through an application of the exemplary technique according to the present 
invention. The graph optimization solution 230 can then be translated back into the 
5 context of the biological problem to be solved. 

Figure 3 shows a second exemplary embodiment of the probe selection 
technique (see process 200 of Figure IB) according to the present invention for 
detennining the optimal set of oligonucleotide probes, in which the exemplary 
embodiments of the processes 210 and 220 of Figure 2 are further illustrated. In this 

10 exemplary embodiment, selected target genotypes and/or haplotypes, which may 
include all known genotypes or haplotypes at one or more predetermined polymorphic 
genetic loci or a subset thereof, depending on whether pre-processing has been 
performed, may be used to generate potential probe sequences 211. The potential 
probe sequences 211 can then be used to generate probe response vectors (PRVs) 212. 

15 Using the PRVs 212, a complete edge-weighted and vertex-weighted graph G=(VyE) 
213 may be constructed and the algorithm parameters estimated and set. Said 
parameters include the following: M, the upper bound on the size of the independent 
set sought; a, the scaling factor used to modify boosting weights of vertices on each 
iteration of the algorithm; p, the edge threshold; parameters minRestartNum and 

20 nolmprovements, used to set loop termination conditions; as well as unnamed 
parameters such as the size of the "current-best" list of independent sets. Criteria for 
estimating M are obtained via probabilistic analysis, discussed in section C.6 of 
present application. Other parameters are chosen by trial and error. 

Once the graph G is constructed, the exemplary technique of the 
25 present invention may be applied to identify one or more optimal subsets of PRVs 
(step 221), which are output from this exemplary technique. In a post-processing 
procedure, one or more of the candidate optimal subsets 222 can be selected for a 
maximum discriminatory power by testing allele coding vectors for redundancy. 

The PRVs present in the optimal subset may then be converted back 
30 into probe nucleotide sequences by reporting the DNA sequence associated with the 
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probe used to generate each PRV contained in the optimal subset (item 231). This 
exemplary procedure may be equivalent to translating the solution of the graph 
optimization problem back into a biological context. 

OVERALL PROCESS DESCRIPTION ACCORDING TO THE PRESENT 
5 INVENTION 

A. MATHEMATICAL FORMULATION 

1 . Definitions 

Let the different HLA types, or alleles, be denoted by T Jt j = 1,.. JV. 
(Here, N = 1750, the approximate number of known HLA types.) Let a given 
10 microarray be denoted by juA ki ke M , where a microarray is defined by a set of 

hybridization probes and their two-dimensional arrangement on the chip surface. The 
process of querying the given DNA sequence (hereafter referred to as a "target" 
sequence) by hybridization can be denoted by the expression 

( Tj>M A k )^ D^fj (1) 

1 5 where 7} is the true allele contained in the target sequence, juA k is the microarray used 
in the query, D is the data output of the hybridization experiment, and 2} is the allele 
inferred from the data. Both processes in (1) are described below. 

The problem of HLA typing can then be formulated as that of 
designing the best microarray, namely, the set and arrangement of probes, which 
20 "works" for all known HLA types (z.e.,Vy). In the present notation, this can mean 
finding juA^ which solves the optimization problem 

minX^^^fJ 

/ (2) 
<=> min^W; ^[Tj^Tj). 

type] 

Here, nx is the indicator function 
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fl, if X is true) 
x [O, otherwise J 

and Wj is the weight assigned to type y. Initially, Wj can be set such that Wj =1 V/ . At 

a later point, it may be desirable to weigh different HLA types differently, based on 
the frequency of their occurrence in human population or some other criteria. 

5 There are several procedures that should be considered in detail: 

obtaining data D from an experiment based on allele 3} and microarray yAk (described 

below in Section A.2), inferring allele fj from the data (described below in Section 

A.3), generating potential microarrays for the typing experiments (described below in 
Section A.4), and selecting the optimal microarray (Section B). 

10 2. (T p vA h )->D 

Consider the set of probes {Pi P w ^, a} constituting microarray juAk, 
initially neglecting their arrangement. Ideally, the outcome of the hybridization of the 
target sequence with each probe P would be binary: 1, if the target contains a 
subsequence complementary to P, and 0, otherwise. Using n = n(k) probes then 

1 5 yields a binary string of length n, or, alternately, a vector of length n, as a code for the 
target sequence. In practice, hybridization results may not necessarily be binary. 
In particular, the measurements are the analog intensity values corresponding to the 
amount of formed probe-target complex for each probe. In addition, in an attempt to 
"factor out" the non-specific signal, each probe can often be present in two versions: 

20 one (e.g., a perfect match, or "pm") perfectly complementary to a region on the target, 
and the other (e.g., a mismatch, or "mm") slightly mismatched, the latter usually 
containing a single base mismatch near the center of the probe. This is the case, for 
example, in Affymetrix GeneChips. In such setup, the signal from probe P may be 
the match-to-mismatch ratio, i.e., the ratio of the intensities corresponding to the 

25 matched and mismatched probe-target complexes. Furthermore, the signal can be 
log-transformed, so that the hybridization outcome for probe P is really the value of 
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TP nm 

pm 

\ TP mmJ 



The situation may further be complicated because the probes may 
hybridize to positions on the target other than those they were designed to detect - 
this is known as "cross-hybridization." In addition, the fact that many probes are 
5 present in the system may cause the signal {i.e., the hybridization outcome) from a 
given probe to differ from the signal of the same probe in the absence of other probes. 
This is described in more detail in Cherepinsky, Ph.D. Thesis, New York University 
2003, Chapter 3. 

Thus, the actual result of a hybridization experiment is a vector of n measurements, 
10 De r. 

3. D -» fj 

To infer the allele from the n-vector, the ideal process should be 
referred to again, where the outcome is Dg {0,l} n . If the probes are selected in such 
a way so as to provide a distinct binary string for each known allele (so that the 

1 5 Hamming distance du between any pair of data vectors D iy Dk is at least 1), then these 
n probes can be sufficient to identify the allele of the target sequence. What is 
preferred is to query the sequence with the n probes and read off the allele to which 
the pattern corresponds. Furthermore, if it is required that 
d H (D i9 D k ) >a forsomea >1, then the discrimination power can be increased and 

20 error-correction is possible. This is discussed in greater detail below (e.g., Section 
B.6). 

In a practical setting, De R n . Thus, as a first procedure, some 
thresholding process must be applied to D to reduce it to a binary string. 

4. Generating Potential Microarrays 
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a. Selection of Informative Probes. A set of n probes, each of length Z, that 
are at least d letters apart (pairwise), must be provided for optimal discrimination 
among the allele, sequences. 

If L is not specified, it can be chosen arbitrarily (say, 20), or allowed to 
5 vary from probe to probe. 

With no restrictions, a very large n can be chosen; for example, every 
possible 20-mer could be used as a probe. However, this would result in 4 20 = 2 40 = 
(2 10 ) 4 > (10 3 ) 4 = 10 12 , or over a trillion, probes. Such a large set may not be desirable, 
since many of these probes would give the same results, and it is too expensive to 
10 produce all of them. Allowing both n and L to vary may give an even larger number 
of potential probe sequences . 

The probe design problem relates to selecting which of the probes are 
most useful in discriminating among the given allele sequences, and how many (or 
rather, how few) one can use appropriately. 

15 b. Arrangement of Probes on the Chip. When a set of probes has been 

selected using techniques described herein above, a question still remains of how to 
arrange these probes on the microarray chip. Several studies indicate that the patterns 
observed in the results of chip experiments may be due to the arrangement of probes 
on the chip. See e.g., Kluger et al y submitted to Nature Genetics, 2004 at URL 

20' http:^ioinfo.mbb.yale.edu/~Muger/pipeline/KLUGERetal_NG.pd^ Yu et al, 
submitted to Nature Biotechnology, 2004 at URL http://bioinfo.mbb.yale.edu/ 
-kluger/pipeline/YQKJNB.pdf; and Qian et al. 9 submitted to Biofechniques, 2004 at 
URL http:/^iomfo.mbb.yde.edu/~kluger/pipeline/QYK_artifact.pdf. In particular, it 
has been observed that the probes are arranged on a chip based on the labels of the 

25 genes they represent, and a gene label is often related to the function and/or disorder 
in which the gene is involved. As a result, genes of shared function have similar 
labels and are coexpressed, generating monochromatic bands on microarray chip 
scans. 

13 
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These studies indicate that additional consideration should be given to 
the arrangement of the probes on the chip, based on certain "conceptual" measure of 
probe distance. The "conceptual" probe distance can use available biological 
knowledge about the portions of the genome containing the probes in question, as 
5 well as a measure of competition between these probes, as discussed Chapter 3 of 
Cherepinsky, Ph.D. Thesis, New York University 2003. The recursive technique 
described below can ensure that the "nearest" probe pairs would be separated by at 
least a specified minimum physical distance on the chip. It is designed to be 
disruptive to neighborhoods, defined with respect to the "conceptual" probe distance. 
10 Thus, the exemplary technique places conceptually nearby probes far apart on the 
surface. 

Consider a bijective function f: {0,..., AT 2 -1} -» {0,...,N-1} x {0,...,iV- 
1 } that maps every pair of "nearby" points in the domain space to a pair of "distant 5 ' 
points in the range space. In particular, the following devised function / with the 
15 following property should be considered: For every x f y, if 
\x-y\ < 4 a , then||/(jc) - f{y)\ x >N/(l a+l ). This function likely gives an optimal 

placement. If the elements of the domain space satisfy other distance properties, this 
technique can be suitably generalized to handle similar properties with respect to the 
new distance metric. 

20 This function can play an important role in determining how to place a 

set of oligonucleotide probes on a microarray surface in such a manner that if two 
probes are close to one another in their genome locations then they are reasonably far 
apart on the array. Thus, a placement determined by the function can minimize the 
competition among the probes for the genomic targets, as well as the systematic 

25 biases in the error processes. 

Inductively, a uniform family of functions fk may be defined as 
follows. Let k<lgN. 
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f M : {0,-.,N 2 

{0 > ...,A^-l}x{0 J ... ) iV r -l} 
:x»(i,j). 



f k+ i is defined in terms of f k , 



/t :{0,..,^V4-l}-^ 

{0,...,tf/2-l}x{0,-,iy/2-l}, 



as follows: 



/wW- 



jr*-0mod4 
/xH + (0»f). *7*-lmod4 
fM + (f' 0 )' '/^-2mod4 



(3) 



This function can be generalized, without its general properties being 
affected, by including a random permutation n M : {0, . . . , 3} -> {0, . . . , 3} as follows: 



/*«(*)= 



f&A <r*-*«(0) mod 4 

/.fell + ( 0 'I> (0 mod 4 

+ (l»I> ^-*«(2)inod4 

+ (s.i> (3) mod 4 



Herein below, & may take the value (IgAM), and the base case is given 



10 by the function 



where 



f 2 :{0,...,15}->{0,...,3}x{0,...,3} 



0i->(0,0) li->(0,2) 2h*(2,0) 3i-><2,2) 

4h»(0,1) 5^(0,3) 6 h-> <2,1> 7h>{2,3) 

8t->(l,0) 9h->(l,2) 10h-»<3,0) llh>(3,2) 

12 H>(1,1) 13h>{1,3) 14h>(3,1> 15 (3,3) 



(4) 



This base map can be described in matrix format as follows: 
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15 



0 4 15 

8 12 9 13 

2 6 3 7 

10 14 11 15 



(5) 



Taking N=2 e 9 N 2 =4 e probes can be placed by applying/, : {q,...,4'-l}, which 
after i-2 recursive steps, defined in equation (3), may reduce to the base case^: 
{0,...,15} shown in equations (4) and (5). 

5 Function can have the following distance properties. Let D{i, j) be 

the distance between probes p t and pj, when arrayed on a line (by relabeling the 
probes, one can view this as the index separation \i- Let d(i, j) be their distance 
when arrayed on the surface. Then, the mapping ft may guarantee that for all pupj for 
which £>(/,y)<4\ d(ij) > 2 e ' k '\ where k=0,...,£-l. Furthermore, if d(i, j)=l, that 
10 is, pi is placed next to pj on the surface, then D(i 9 j) > 3 • 4 e ~ 2 . 

For example, let I = 3. There are N 2 = 4 £ = 64 probes, which are 
placed by according to 
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likely satisfy the condition that if D < 4 k , then d >N/2 M = 2 e ~ k ~ ] . 

The problem of an automatic generation of probe sets for DNA 
microarrays is described in Krause et al., Second IEEE International Workshop on 
High Performance Computational Biology (HiCOMB 2003), online proceedings at 
5 URL http://hpc.eece.unm.edu/HiCOMB/proceedings.htail. However, the work 
described by Krause et al. aims for a probe set that is, even in ideal circumstances, 
asymptotically much larger than the one generated by the exemplary embodiments of 
the present invention. 

Other biological problems, such as identifying an unknown pathogen 
10 as a member of a list of known pathogens, be they viral (see Rash and Gusfield, in 
Proceedings of the Sixth Annual International Conference on Computational Biology 
(RECOMB f 02), ACM Press, pp. 254-261) or bacterial (see Borneman et al., 
Bioinformatics 2001,17:S39), likely have the same mathematical formulation as the 
problem of HLA typing discussed here. Those applications can also benefit from the 
15 improvements to existing approaches provided by the exemplary embodiments of the 
present invention. 

B. DEFINITIONS 

The exemplary problem of selecting the optimal microarray is 
described below. The problem of selecting the constituent probes can be reduced to a 
20 "best independent set" problem. The following sections can define the graph model 
employed, as well as the meaning of the term "best independent set," and describe the 
optimizing algorithm. 

1. Notation 

There are N known alleles, and n potential probes. Each probe can be 
25 described by a "response vector" v y e{0,l}", y=l,...,/i. The response vector data 
can be represented in tabular form: 
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v 2 • 


- \ 


HLA X 


1 


0 • 


■ 1 


HLA, 


1 


1 • 


■• 0 


HLA N 


0 


0 • 


•• 1 



Column j is the response vector for probe v y : 

yj =(v 7 [l],,v y [2],... > v ; [iV]) 7 - 5 (7) 
and row i is the code for allele i, which we will call HLA,-: 

5 HLA t =(vi[i], v 2 [i],...,v„[i]) (8) 

2. Original Graph 

Each potential probe can form a vertex in the graph. Conceptually, an 
edge in the graph should connect two probes with shared characteristics. 

First, essentially a complete edge- and vertex-weighted undirected 
10 graph G = (VJE) on n vertices is constructed, where n is the number of potential 
probes. In a general problem formulation, n can be very large. For each probe length 
L, there are likely 4 L possible probes. For instance, as shown in above in Section A.4, 
there may be over a trillion possible probes of length 20. Thus, the graph in the probe 
interaction model can be very large. 

15 Second, weights are assigned to each vertex v and to each edge e, 

0<w(v), w(e)<l. The weight of a vertex can be initially set to the "information 
content" of the corresponding probe response vector (PRV) with respect to the HLA 
typing problem: 

w (v) = min (percentage of 0' s, percentage of Y s}/l 00. (9) 

20 The term "information content" can be used here differently than defined in 
information theory, where it may mean the minimum amount of information needed 
to send a string. See e.g., Cover and Thomas. Elements of Information Theory. John 
Wiley and Sons, New York. 1991. Ideally, if possible, all vertices should have 
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weight 0.5. A vertex with a weight too close to zero can be uninformative, and the 
corresponding probe should only be used if it serves to differentiate an allele that is 
not distinguishable by using other, more informative probes. 

The weight of an edge is initially set to the scaled Hamming distance 
5 of the probe response vectors represented by its endpoints: 

w(e) = Hamming distance/vector length (10) 
with values close to zero corresponding to sequence-similar probes. 

3. Thresholded Graph 

Third, the graph G is transformed by thresholding the edges. A 
10 threshold p, is selected and used to generate a modified graph G mo d = (V, E mo(i ) 9 where 

E mod = {eeE:w(e)<p} 

is a set of unweighted edges, and the set of weighted vertices V is unchanged. The 
choice of threshold p will be discussed in further detail below in section D. Hereafter, 
this modified vertex-weighted graph is employed and denoted by G. 

15 An independent set can be defined on the graph. An independent set is 

defined as a set of vertices such that for any pair of vertices, there is no edge between 
them. See Dictionary of Algorithms and Data . Structures at NIST at URL 
http://www.nist.gov/dads/. In particular, a set of vertices V x c V can be an 

independent set if (u e V 1 and v e F ) implies that {w, v } g E. 
20 4. Exemplary Goal 

The best microarray can be defined above in Section A. The 
corresponding concept is now formulated on the graph model The best independent 
set can be defined as a maximum weight yet minimum size independent set. Thus, 
such a set S czV must be an independent set, have maximum weight 

25 w(s) - ves w ( v )> 311(1 minimum cardinality | S | . The condition of independence 
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is meant to preclude any unintended interaction among the chosen probes. Maximum 
weight provides S with maximum discrimination power. Minimum size likely ensures 
that the smallest collection of probes is used to perform the preferred functions. 

Since all vertex weights are nonnegative, the requirements of 
5 maximum weight and minimum cardinality are clearly contradictory. The definition 
can be relaxed somewhat by specifying a priori the desired size M of the set, and 
looking instead for the maximum weight independent set of size <M 

C. OPTIMIZATION PROCEDURE 

To achieve this goal, a modification of a Maximal Independent Set 
10 procedure, described in Luby, Proceedings of the Seventh Annual ACM Symposium 
on Theory of Computing (STOC '85), pp. 1-10, is utilized. 

1 . Procedure Pseudocode 

Given graph G and set size M: 

' a) Initialization: 
15 (i) Initialize a "current-best" list of independent sets, with 



associated information weights. It stores a list of the 
best, say, 20, independent sets seen so far, sorted by 
information weight. 



20 



b) 



Restart Loop: Execute at least minRestartNum times; if the 
"current-best" list is not full (i.e., does not have 20 independent 
sets) by then, keep repeating until the list is filled. 



(i) 



Initialize boosting weights; this was accomplished by 
setting the boosting weights to the information weights 
of the vertices: 



25 



Vv e V, w b (v) <-w(v). 
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(ii) Boosting Loop: Repeat until no improvements have 
been made to the "current-best" list for a fixed number 
of iterations (say, five iterations). 

aa. Choose a set S of M vertices randomly from V 9 with 



bb. For each edge {u,v} in G\s (the induced subgraph on 
5), eliminate one of the endpoint vertices. This leaves a 
set, S\ 9 of K < M independent vertices. 

cc. Adjust the boosting weights of vertices in S\ for 
example, increase the boosting weights of the vertices 
in S\\ 

w b ( v ) <- a w b (*) Vv € s \ 

and decrease the boosting weights of the vertices in 
S-S x : 



(v) <r- - W b (v) VVGS" 5, 



where a > 1 is some previously selected constant. 

dd. If 1S1 is not already in the "current-best" list and provides 
an improvement over some current member of the list, 
reset the nolmprovements counter, locate an appropriate 
location for S\ in the list, and update the list. Otherwise, 
make a note that no changes to "current-besf 5 list were 
made on this iteration (z.e., increment the 
nolmprovements counter). 
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(iii) If the condition for continuing the restart loop holds 
(namely, minRestartNwn restarts have not yet been 
executed or the "current-best" list is not yet full), reset 
the nolmprovements counter and repeat step (b). 
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2 . Procedure Description 

The procedure can utilize vertex boosting weights (e.g., initially set to 
probe information weights) to define a probability distribution on the vertex set. On 
each iteration of the boosting loop (step b.ii above), a random subset of a specified 

5 size can be selected according to the current probability distribution (step b.ii.aa). All 
edges in the induced subgraph on this random subset may be broken, with one of the 
terminal vertices removed (step b.ii.bb). The boosting weights of the elements of the 
subset can then be modified (step b.ii.cc), so that the vertices that stayed in the subset 
are more likely, and the vertices that were removed are less likely to be selected on 

1 0 the next iteration. The boosting loop can terminate after a certain number of iterations 
with no improvement to the list of top independent sets (to allow some flexibility, the 
algorithm keeps track of several of the top independent sets instead of only storing the 
best one seen so far). 

The procedure can also restart the boosting loop several times with 
15 original probe information weights. This feature can be used to prevent convergence 
to a local optimum, which is possible for high values of the boosting factor. 

3 . Detailed Explanation of Procedure 

The boosting procedure (as a whole) can be viewed as operating on the 
probability space of all subsets of our graph. Step b.ii.bb provides that the selected 

20 subset is independent, so that the probability distribution can only be supported on 
independent sets (i.e., the distribution is zero on all non-independent sets). In this 
view, the procedure can converge to a probability distribution where the best 
independent set has the highest probability. Each iteration of the boosting loop 
adjusts the probabilities associated with each vertex in the graph. The subset of 

25 interest is always drawn randomly according to the current probability distribution. 

If the solution S* is known a priori, its selection by the procedure can 
possibly be guaranteed by initializing the boosting weights in step b.i to be 
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Vv e S*, w 6 (v)<^ 1, 
VveK-S*, w 6 (v)<-0. 

In other words, the associated probability distribution may have a 
probability of 1 for each vertex v e S* , and a probability of 0 for each remaining 
vertex v e V- S*. 

5 Given an unlimited time for obtaining a solution (and an appropriate 

set of parameters), the boosting procedure can converge to this ideal distribution. 
However, when time is limited, a "good" (i.e., informative) independent set of size < 
M is only "more likely" than other independent sets of similar size. The procedure is 
provided to give an effective solution in a limited time, and yet be able to improve on 

10 it iteratively when more time is permitted (with minimal modifications to the loop 
terminating conditions). 

For example, the best independent set may be, ideally, a fixed point for 
the algorithm, in the sense that if the procedure starts at a perturbed location in the 
subset probability space, it should converge to the optimal set. In particular, if the 
15 initial probability distribution is heavily favored towards a set that does not differ 
from the best set in many vertices, the procedure likely converges to the best set. 

4. Breaking Edges 

In step b.ii.bb of the boosting procedure can keep the vertex that has 
the higher boosting weight. If vertices have equal boosting weights, one at random 
20 (with probability X A) can be selected. 

5 . Selecting Scaling Factor a 

In step b.ii.cc of the boosting algorithm, the weights of those vertices 
that were selected and kept are boosted (scaled up) by a factor of a > 1, while the 
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weights of discarded vertices are scaled down by the same factor. This has the effect 
of noting which vertices are selected for membership in the independent set, and 
increasing the likelihood that these vertices will be selected in the future, with the 
reverse effect on the discarded vertices. The manner in which the value of the scaling 
5 factor a affects the "memory" of the probability space evolution is discussed below. 
A single "restart" of the procedure (namely, step b.ii) is described. 

a) Extreme cases: 

• a = 1: No memory of previous selections. Ignore the current 
selection, and choose anew on the next iteration. The boosting 

10 procedure performs an exhaustive search. 

• a = oo : Perfect memory. Once a set S of vertices is selected and 
pruned, and its elements' boosting weights are modified, each of 
the vertices remaining in the independent set S\ will have a 
boosting weight of oo and each of those thrown out of the 

15 independent set due to conflicts will have a boosting weight of 0. 

Thereafter, the boosting procedure will always choose the 
independent set selected on the first run. 

b) Real values: 

The boosting procedure is executed on the same graph model with 
20 several values of a e {2, 1.5, 1.2, 1.1}. The executions with higher values of a were 
observed to terminate a single "restart" after a smaller number of iterations than those 
with lower values of a. Values of a can be chosen by many methods, known to those 
with ordinary skill in the art. 

25 6. Choosing M\ the Maximum Size of the Independent Set 

This section contains a probabilistic analysis of an answer to the 
following question: What are the bounds on the number of probes, k, that is sufficient 
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10 



to distinguish ATknown alleles? In order to answer this question, certain assumptions 
be made on the random distribution from which the known alleles are assumed to 



can 
be drawn 



a) Similar PRV Entries 
5 Assume that each probe, at each index i = 1,...,M assumes values 0 

and 1 independently and with equal probability. Consider k such probes and two 
alleles (HLA/ and HLA m ). Thus, if HLA/ is fixed: 



then for each/', 



HLA l = (HLA,[l],...,HLA i [fc]) 1 



Pr (HLAt»t?| 8=1 HLAjt?]) »'i 
Pr(HLAro[7] ^ HLA^'J) - | 



then, for these two HLA vectors, 

Pr (The Hamming distance between 2 HLA vectors = x) 



(11) 



which can easily be seen as follows: 



Pr (The Hamming dist bet'n 2 HLA vectors = a?) 
bx Pr(HLA vectors differ in exactly x positions) 
/ x successes in A; Bernoulli trials, 
where 

success = {HLA m (j] ?4 HLAi[?]} 
\ and p = Pr (success) = | 
*-» /ia 

i-fc 



15 

Thus, for a fixed pair of alleles, 



- Pr ^ 
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Pr(*>l) = 1- Pr(s = 0) = 1 - (by (11)) 

- 1-2"*, (12) 

and 



pairs 

= U(l-2- fc ) = (l-2-*)#' jaitti (by (12)) 

. - (l-2" fc )(») (13) 

since there are N distinct allele vectors and pairs are unordered. 

.5 This probability ideally should be greater than (1-e) for some fixed 

small 0<e«l, i.e., 

(14) 

First, the left-hand side term is bound: 

(l-2-*)(?) . [(l-2-*) 2t ] C ^ 

. > { & ) 

= e 



1 0 where the inequality comes from the bound (appendix A) 

> e * for large n. 



(15) 

For the inequality in (14) to operate, the bound (15) should be in the correct direction. 
Suppose a > b is desired. If instead a > c is demonstrated and the parameter is 
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selected such that Ob, then it can be concluded from a > c> b that a > b. Therefore, 
the above inequality chain would work if found (15) holds. 

Hereafter, the symbol is used to indicate the steps in the inequality 
reduction that can satisfy the previous statements whenever the parameter in question 
5 is selected to satisfy the current statement. 

Thus, an inequality (14) can be reduced to the following: 

-(5) 

e > 1 — € 



(l + 2-*)2- ft >In(l-e) 

(16) 



Next, consider the right-hand term: ln(l - x) < - x for 0 < x < 1 . Again, a > b can be 
desired. If b < d is demonstrated and the parameter is selected such mat a > d, it can 
10 be concluded from a > d > b that a > b. This permits the reduction of the inequality 
(16) to the following: 



+ 2 -*)2- fc >-e 



4=> £> ' " 



(2) (l + 2-*)2- 



(17) 
(18) 



since 



(l + 2- fc )2- & = (2 fc + l)2- 2fe = (2 fe + l)4- fc . (19) 

Furthermore, 

0+1 £ + 1 

P + 1 (20) 
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Hence, taking = 2* yields 



4 fc 



2* + l 



> 2 k - L 



(21) 



Thus, (18) follows if kis chosen to satisfy 



2*-l> 



2*> 



+ 1 



(22) 



5 The remaining chain of inequalities, from "desired" to "obtained," can now be 
verified. For example, the inequality (17) can be extended 



which in turn can be extended 



- (^j (1 + 2~ fc ) 2- k > -e > In(l - c) 
-(?){i+2- fc )2-* 



/, -(?)<l+2- fc )2- fc 

(1 -2 fc )U> > e ■ > 1 -e 

(1 - 2"" fc )(" ) > 1 - c, as desired. 



10 Therefore, A: (given e , N) is selected to satisfy (22): 



2*> 



+ 1 



A simpler bound on k can be obtained by imposing a stronger condition 



went 



2* > (2/e) 



0- 



(23) 
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which implies (22) since (l/e)j^j>l. The right-hand side of equation (23) 
simplifies to 

(2/e) (*) = (2A)^y^ = {1/€)N{N - 1), 

so that 

&>igiV + l g (jV~l)~Ige 
5 (24) 

is equivalent to equation (23). Furthermore, since 2 \gN> lgN+ lg (N- 1), selecting 

fe> 2lgN- \$€ 
certainly gives a value ofk that satisfies (23). Therefore, requiring 



k > 21giV-lge 



(25) 



1 0 imposes the strongest condition of those listed above. Hence, a value of k that satisfies 
equation (25) also satisfies equation (22), and therefore the original desired inequality 
(14). 

b) Dissimilar PRV Entries. A violation of the similarity 
assumptions may be modeled by an error term 5. Suppose a probe fails to contribute 
15 to a Hamming distance with probability (1 + 8)/2. As discussed previously, each 
position of the HLA code vector can be considered as a Bernoulli trial, where success 
is defined as the event that / h entry of a code vector contributes to the Hamming 
distance, i.c. f {HLA,„[/] * HLA/[/]}, so that 

q » Pr (failure) = (l + *)/2 
p = Pr (success) = (1 - S)/2 

20 Therefore, 
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Pr (The Hamm dist = x) 



(l-<S)*(l + 5) fc - x 2- 
^ (26) 

Continuing as before, the following condition can be obtained. 



Pr(x > 1) - 1 - Pr (x = 0) 
- l- (*)(l-fi)°(l+5) fe -°2 
1- {1 + ^*2-*, 



(27) 
and 

Pr (Vp ai « a; > 1) — (l — (1 + djV*)® . 

5 

This probability should be bigger than (1 - e). In other words, 

(iHl + M^Tl-e 

• * . i (28) 

♦ LHS(28) > • e 
want „ 

* 1_€ (29) 



10 
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where the last transformation is obtained as in equation (19), replacing 2 by 2/(1 + S). 
The same substitution in (21) (i.e., taking ^= (2/(1 + S)f in (20)) yields 



(*) 



2k 



\ 1+5 ) 

Thus, equation (30) follows ilk is chosen to satisfy 



VJfeeN 



(31) 



(32) 



Again, a simpler bound on k can be obtained by imposing a stronger condition 



(_^) fc T(2A)(^) = (l/e)^-l) 



(33) 



so that 



*[(t£j)1-*< i -« 1 + 



> lgN + lg(JV-l)-lgd 

ft(l-lg(l + <5)) >21g/V-lgfi. 



(34) 



fc > ( i-i 6 ttW 21s "~ Ige] 



10 (35) 

c) Non-unit Minimum Hamming Distance. Further, the preferable 
size k can be estimated for (almost) any desired minimum Hamming distance between 
allele code vectors. As demonstrated above, the Hamming distance between a pair of 
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HLA vectors is a binomial random variable x ~ S (n, p) where # trials s n = k, Pr 
(success) =/>=(!- 5)/2, and Pr (failure) ■ q = (1 + 5)/2: 



Pr(a;) = Q(I - 5)*(1 + 5) & - s 2-* 



Its mean is wp = & (1 - <5)/2 and variance is npq = k(l- The following estimate 
5 can be obtained (using Chernoff bounds) : 

Pr(s<fc(l-*)/4)< e 

(36) 

Chemoff inequality states (see Appendix B for the proof): 

— 4-7W? 

Pr(S(n,p).<(l-A)np)< e 3 . 

(37) 

n = k,p = (l-<5)/ 2, and let A = | . Then by equation (37), 

Pr(s<fc(l-5)/4) < e 2 2 
^S* 1 ? 4 -fe(l-*)/16 

Thus, it can be estimated for which k 

Pt (W* £ *(* - *)/<*) 2: i - «■ (38) 

From equation (36), it is possible to obtain 



Pr(*>*(l-5)/4)>l - e - fe ^- 5 >A6 



15 and hence, 
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Pr(V pairs a;>fc{l-a)/4) 
= Pr(sc£&(l-£)/4)(*) 

i-. ■ j 



want 
> 1-e 



(39) 



(expression (39)) 

>exp rUJl 1 " , " e J 6 / 



want 

> 1-e 



1+ e -fe(l-<5)/16^ c -*(l-4)/16 



> ]n(l - <0 

-© ( 

(see (17)) 

UK 1+e r 

-l>(l/e)Q (by (20)) 



.Jb(l-*)/16 

e > 



e > 



(2/0(2) = ( 1 AWN-1) 



(see (23)) 

&(1 - <5)/16 >\nN + ]n{N - 1) - In e 
ft(l-<J)/16>21niV-bie 



5 Therefore if, 



A> j~[21nJV-lne] 



(40) 
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then 



For this exemplary probe set (with k=k(e, N,d)\ an arbitrarily high probability can 
be obtained, by the selection of e in (38), that all HLA coding vectors have pairwise 
5 Hamming distance of at least k(l - <5)/4. Thus, this exemplary probe set is able to 
correct £(1 - S)/8 errors by choosing the coding vector closest to that obtained. 

The error term £ should then be estimated. This can be accomplished 
on a given set S of probes by sampling pairs {/, m} of indices on probes from S and 
examining the resulting 2-vectbrs on {0, 1 } . The probability of a failure to contribute 
10 to the Hamming distance (given by (1 + S)/2) can be estimated by the frequency f= of 
observing equal entries in the 2-vector, since each probe with equal entries in the 
2-vector fails to contribute to the Hamming distance between alleles / and m. 
Therefore, 

J-2/--1 (41) 

15 Let/i denote the frequency of observing unequal entries in the same setting. Thus,^ 
= 1, and it is possible to obtain 



(42) 



1 - 5 = 1 - (2/« - 1) = 2(W = ) 2/^. 
The bound in equation (40) becomes 

& > ;H~[21nW~ine] 

(43) 

20 Thus, to generate distinct coding vectors for all alleles (e.g., to 

guarantee a Hamming distance ddfiu cj) > 1 w.p. > 1 - e), it is possible to select M> 
k, where k satisfies (35) with £ estimated as in (41). It is also possible to choose M to 
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allow for error correction of up to D/2 errors (guaranteeing w.p. > (1 - e) a minimum 
Hamming distance d H {ci> cj) >D)\ set 

D~k(l-6)/4=kU/2 

in equation (38), so that k = 2D I U should satisfy equation (40), and, again, select M> 
5 k. 

D. PRE-PROCESSING 

1 . Initial Probe Selection 

In section B2, it was described that starting with all possible probes 
results in a graph that has many vertices. The discussion below provides certain pre- 
1 0 processing steps that allow the elimination of a large portion of this probe set. 

a. Probes that do not hit the HLA region on any allele. Many of 
the possible length-Z probes generally do not provide sequence-specific information 
about the target. As such, they may be safely left out of our probe selection process. 
This would allow for a reduction of the starting (perfectly matched) probe set to those 
15 probes that are complementary to a subsequence of at least one of the alleles. A way 
to obtain such set can be as follows. Assume that the allele sequences are provided in 
the 5 5 to 3 5 orientation. 

A window of length L can be considered along allele T\. Denote the length of 
the allelic sequence by len{T\\ and index elements of the sequence starting with 1, so 
20 that the entire allele sequence can be denoted by 

Ti[l)<<<Ti[lm{Ti)}> 

A probe complementary to the allele subsequence seen through the window 
can be constructed and placed in the set. The window then may be shifted by one 
nucleotide in the direction of the 3 '-end. This process can be repeated until the last 
25 window [k... (L + k- 1)] reaches the end of the target sequence: 
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L + A-l = len{T t ) 

k ~ len(Ti) - L + 1, 

thus, generating a set of (len{T\) - Z + 1) probes, each perfectly complementary to T\. 

The procedure described above can generate all probes of length L that 
are, e.g., perfectly complementary to a length-Z, subsequence of the target (i.e., allele) 
5 sequence. Depending on the form in which the allele sequences are given, it may also 
be desirable to include probes corresponding to windows that are partially shifted off 
the a|lele, i.e., windows showing a portion of the given allele sequence together with 
the corresponding 5 '-tail of the sequence, if the window is shifted to the left, or the 3 
tail, if the window is shifted to the right. There are 2(L - 1) such probes, 
10 corresponding to indices 

[(/rnCTO - L + 2) . . . (Un{Ti) + . . . , 

[ten(T l )...(ten(T 1 ) + L-.i)} 

for the right-shifted windows and "indices" 

[0--{^-i)],... J [(2-L)..,l] 
for the left-shifted windows. 

This process can be repeated for the other alleles r 2 ,...,7V. To avoid 
placing duplicate probes in the set, a generated probe can be added to the set if its 
sequence is not already present. Alternately, the duplicates can be weeded out 
subsequently. It should be noted that this may have the added advantage of 
eliminating probes hitting sequence repeats. 

The above-described exemplary process has the effect of eliminating 
probes that hit genomic sequences outside the target region, including probes that hit 
introns (if allelic sequences are provided in genomic DNA form) from the original 
collection of all possible probes of length L. The resulting set may contain only those 
probes complementary to subsequences of the HLA region, or sub-words of the pool 
of all allele sequences. 
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b. Non-informative probes. In the set created as described in the 
previous section, some (and perhaps many) of the probes would not be able to give 
any information useful for distinguishing among the alleles. These are the probes that 
may be drawn from the windows that are shared among the alleles - they hybridize to 

5 a common subsequence of the alleles. Any such probe may be useless for 
discriminating alleles: to such a probe, all alleles will look alike. Therefore, these 
probes can be safely eliminated from the potential probe set. 

To locate all such probes, it is necessary to find the common subsequences of 
length > L of all the alleles, identify the probes complementary to these subsequences, 
10 and remove these probes from the set. This can be done as the next step in the 
"refinement' 5 of the starting probe set, and/or included as a condition in the process 
for probe addition specified in the previous section. 

c. Potential for cross-hybridization. The probes that are likely to 
hit multiple sites on the target sequence(s), such as those hitting a repeated region, 

15 should be eliminated, as is usually done in microarray design, as their use is likely to 
produce a high level of noise. Each probe is usually expected to have a unique site on 
the target. See e.g., Lockhart et al 9 Nature Biotechnology 1996,14:1675; Kaderali 
and Schliep, An algorithm to select target specific probes for DNA chips at URL 
http://citeseer.nj.nec.coin/kaderali01algorithm.htird; Li and Stormo, Bioinformatics, 

20 2001,17:1067. 

2. Graph Generation 

a. Generating Probe Response Vectors. Once a set of initial 
probes is selected, as described above, a probe response vector (7) must be generated 
for each of these probes. To do that, given probe j 9 string-matching is performed on 
25 each of the N alleles for the Watson-Crick complement of probe j 9 and the results are 
used to set 

r ., / 1, if there is a match with alkie 2} 
v * m \ 0, otherwise 
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b. Choice of Edge Threshold. The edge threshold parameter p 
was used in section B.3 to transform the initial complete edge-weighted graph. Its 
value determines how many edges remain in the graph, as well as how "independent" 
each independent set on the graph really is. 

5 • If p is too small, there will be very few edges in the graph. Most of the 

random sets selected by the boosting algorithm will prove to be 
independent. However, upon examination in post-processing, 
described in section B.3, it may be found that many of these sets do not 
possess enough discrimination power to discern all N known alleles. 

10 • If, on the other hand, p is too large (e.g., p > 0.5), the graph will be 

very dense {i.e., have a lot of edges). The algorithm will then have a 
much harder time finding an independent set of large enough size. The 
output sets will likely contain much fewer than M vertices, and there 
may not be enough probes in the candidate sets to discern all N alleles. 

15 • A reasonable value of p is obtained by trial and error on a given set of 

potential probes. 

E. POST-PROCESSING 

The procedure (as described in section C.l) can return a list of 20 best 
independent sets, sorted by the total information weight of the constituent vertices. 

20 Each set may be composed of e.g., at most M vertices (probes). While the 
independence and maximum weight conditions can be selected to steer each selected 
set towards maximum discrimination power, this desired outcome may not be 
guaranteed. Thus, each of these best independent sets should be checked for 
redundancy of the allele coding vectors. Given a set S of probe response vectors, the 

25 N allele coding vectors generated by S (the rows in (6)) may be extracted and their 
pairwise Hamming distances rf#(c/,Cj), 1 < i <j < \ S | may be computed. If min^ 
d H {cu cj)=0, the set lacks discrimination power: at least two of the codes are the same, 
so the set will not be able to discern all known alleles. Such a set should not be used 
"as is", and should either be discarded or supplemented by additional probes. This 
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may indicate that the set is not truly independent, so the choice of edge threshold p, 
discussed in Section D.2.b, was inappropriate. 

It is possible to make the testing more stringent, in order to allow for 
up to D/2 errors in the data, as discussed in Section C.6. Those independent sets of 
5 the list of best sets that pass the redundancy testing (by satisfying min^y d^c u cj)>D\ 
in fact satisfy a definition stronger than that formulated for the best independent set. 
D-best independent set denotes a best independent set with the additional condition 
mhifij} dn(cj, cj)>D. 

Those sets that pass the redundancy test can be reordered by 
10 aveHamDist = ave^y d^c it cj)\ once the minimum allele code separation is 
guaranteed, the usefulness of a probe set to the HLA typing problem can be judged by 
the metric aveHamDist. 

F. INTERPRETING RESULTS 

Given the best independent set generated by the above-described 
15 procedure a determination regarding how it can be converted into a microarray for 
genotyping or haplotyping experiments must be made. 

In order to generate the microarray corresponding to an independent 
set of vectors yielded by the procedure (followed by post-processing steps from 
Section E), the DNA sequence for each probe, which was used to generate the probe 
20 response vector used in the analysis, should be recalled. The spatial arrangement of 
these probes on the chip surface can be decided as discussed above in Section A.4.b. 

G. ADDITIONAL APPLICATIONS 

Many extensions of the approaches presented herein are possible 
within die scope of the present invention. Two exemplary approaches are discussed 
25 below. 

1 . Extending weight functions in the graph model 
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The graph model discussed herein relies on the characteristics of the 
probe response vectors to define the weights of vertices and edges. While this model 
can generate certain interesting results, it can be extended to a more meaningful 
model by incorporating the physical properties of the probe sequences and their 
5 interactions, some of which are described in Cherepinsky, Ph.D. Thesis, New York 
University 2003, Chapter 3. 

In particular, the annotation of all potential probes with physical 
properties, such as melting temperature, free energy, entropy, and enthalpy of 
hybridization, for perfect matches and for closest matches in other alleles can be used 
10 to define cost functions that determine the weights. While the vertex weight may 
provide a measure of the performance of the corresponding probe in discriminating 
among known alleles, the pairwise probe interaction and the resulting competition 
effects, as "described in Cherepinsky, Ph.D. Thesis, New York University 2003, 
Chapter 3, can be reflected in the edge weights. 

15 2 . Pooling real data from previously tested chips 

Another extension of the procedure discussed herein involves the use 
of data from microarray chips used for HLA typing by different companies. Many 
biotechnology companies are working on the HLA typing problem in the hope of 
designing probe sets that give the answer more quickly and with greater accuracy. 
20 The sequences of the probes may generally be considered to be proprietary 
information and thus not shared. As a result, the collection of experimental data from 
testing the various probes in different combinations and arrangements on the 
microarray chips generated by different companies may almost never be examined as 
a whole. 

25 It is possible to employ the probe interaction model presented herein to 

make use of the aggregate experimental data. Suppose the following information can 
be obtained: a set of microarray chips along with some identifiers, if not the actual 
sequences, of the probes comprising each chip, and values measuring the performance 
of each chip in all previously conducted HLA typing experiments. That is, for each 
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chip, there is a list of unique probe identifiers and some measure of how well this chip 
performed in HLA typing. It may not be necessary to know the sequence of each 
probe, so long as the uniqueness of the identifiers can be verified by the company 
providing the data. . It is possible to combine the information from a large number of 
5 such previously tested chips to generate a plan for a new microarray chip (i.e. 9 a 
collection of probe identifiers and their spatial arrangement) with a performance value 
higher than that of "input" chips by the following process. The probe content and 
arrangement for each chip, together with its performance value, can be used to build 
the graph model. Vertex weights can be inferred from chip membership information. 
10 Edge weights can be estimated from conditional probabilities using pairwise 
membership information - that is, by considering two chips at a time, quantities such 
as the conditional probability that probe Pi was used on chip Q, given that it was used 
on chip Cfc <ian be estimated. Once the graph is constructed, the boosting algorithm 
can be used to generate the best set of probes, as discussed in Section C.l . 

15 All publications cited above are incorporated herein by reference in 

their entireties. 
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APPENDIX 

APPENDIX A: EXPONENTIAL LIMIT INEQUALITY: PROOF 



Claim: For large n, 

(x-i)%e- 1 -*. (A,) 



Proof: 

Inequality (AI) is equivalent to 



n 

Since the series expansion of tKe logarithm is given by 



(A3) 



we can expand the left-hand side of (A2) as follows: 



{by (A3)) 



= -n< 



°* 1 A 1 



V 1 1 1 / A A\ 



Thus, inequality (A2) reduces to 

111 want 1 

-•-s-w-*?— • > (A5) 



or, equivalently, 

3n 2 4n* x 5^ ' " ^ 2n n 2n 



111 wank 1 1 1 .... 

^2 + Z3 + -^4+--- < -5r + r = ^r (A6) 
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Now, 



1 _L_ 1 

3n 2 + 4» 3 + 5^ " *" 

3n 2 \ n n 2 ) 

_ _1 _1_ (geometric sura) 

3r> 2 ITT 

11 w <" i- (by.(A6» 
3n. n-1 27v 



Simplifying yields 



I wajjt 1 



3(n-I) 2 
2 < 3(tt - 1) = 3n - 3 

■e=* 5 < 3n, 

which holds for every n > 2. Retracing the chain of 
inequalities, we obtain 



H)" 



> e " 1 " Vn>2 



(A7) 



as desired. 



APPENDIX B: CKEKNOFFS INEQUALITY: 
PROOF 



Claim: 

Pr(5(n t |?)<(l-e)np)< e W > «€{0,1>. (Bl) 
Proof: 

5 is a Binomial random variable: 

S(n } p)=Xi + *-+X ny (B2) 
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where X( are M.dxv/s with . 
Therefore, 

E(S) - £ E (*i) - X> « ^ ( B4 ) 

Since 

S<(l-e)np (B5) 

A(S r- np) < -Amp VA > Q 
-A(5-np)> Aenp VA>0 

e > e VA>0, 

it follows that 

Pr (5 < (1 - e)np) = Pr ( e' HS ' n,y) > e^)(B6) 



E 
< — 



e 



i (by Markov's inequality) (B7) 



e 



For a proof of Markov's inequality, see, e.g,, (24]. 
Prom (B2) and (B3), we know that 

S - np - £ X* - np m *£(Xi » P) • (B 8) 

Therefore* 

E[e- A(S -" rt ]=E[e- iS ' <X '- , ' > ] (B9) 
53 n^[ e (by independence) . 

= { E ovoq, 
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* E[e j =pe +(l~P)e \ W (B10) 

= e A " (p e " A + (1 - p)) 6 Ap (1 + p ( e " A _ 1} ) 

< e f e J (since l+M<e v V«) 

= e < e , (Bll) 

where the Ia$t inequality follows from 

<T*<l-A + y VA>Q 
A 2 

e" A -l + A<^ (B12) 
Therefore, by (B9) and (Bll), 

„[ -MS-mo] / p£\ n ^np • ■ 

[ J I e J = ( B13 ) 



and 



Pr (5 < (1 - e)np) < e- Xmp E [ e'^^ 
< e . - e ; VA>0 (B14) 
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so that 



and, from (B14) 5 



Pr (S < (1 - e}np) < e~^" nP to € (0, 1) 



(B16) 



as desired 

It remains to chedc that the optimizing A = A* is a 
minimum of /(A), that is, /"(A*) > 0. By (B15), 

/'W - 7'((A) . np(A - e} + /(Ah np | A ^. 
-£(^^(0)'+/(A*)^p 
o 

~ npe >Q. 
.\ A* = e is a minimum. 
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